Data set on various variables that could possibly impact on students’ academic performance has been chose for this report.
Here are 3 questions to be answered in this report using the data set chosen.
The data set chosen comes from Aman Chauhan on: https://www.kaggle.com/datasets/whenamancodes/alcohol-effects-on-study?datasetId=2479552 (keggle).
It is called “Alcohol Effects On Study”, but it also has many other variables describing students.
The data set describes performance of students in Mathematics class and Portuguese class with a variety of attributes of students from two Portuguese schools. These attributes include their gender, status of parents, alcohol consumption, their activities etc. We have chose to use maths data set only as mathematics is studied globally where Portuguese is not.
| school | sex | age | address | famsize | Pstatus | Medu | Fedu |
|---|---|---|---|---|---|---|---|
| GP | F | 18 | U | GT3 | A | 4 | 4 |
| GP | F | 17 | U | GT3 | T | 1 | 1 |
| GP | F | 15 | U | LE3 | T | 1 | 1 |
| GP | F | 15 | U | GT3 | T | 4 | 2 |
| GP | F | 16 | U | GT3 | T | 3 | 3 |
| GP | M | 16 | U | LE3 | T | 4 | 3 |
| Mjob | Fjob | reason | guardian | traveltime | studytime | failures | schoolsup |
|---|---|---|---|---|---|---|---|
| at_home | teacher | course | mother | 2 | 2 | 0 | yes |
| at_home | other | course | father | 1 | 2 | 0 | no |
| at_home | other | other | mother | 1 | 2 | 3 | yes |
| health | services | home | mother | 1 | 3 | 0 | no |
| other | other | home | father | 1 | 2 | 0 | no |
| services | other | reputation | mother | 1 | 2 | 0 | no |
| famsup | paid | activities | nursery | higher | internet | romantic | famrel |
|---|---|---|---|---|---|---|---|
| no | no | no | yes | yes | no | no | 4 |
| yes | no | no | no | yes | yes | no | 5 |
| no | yes | no | yes | yes | yes | no | 4 |
| yes | yes | yes | yes | yes | yes | yes | 3 |
| yes | yes | no | yes | yes | no | no | 4 |
| yes | yes | yes | yes | yes | yes | no | 5 |
| freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 |
|---|---|---|---|---|---|---|---|---|
| 3 | 4 | 1 | 1 | 3 | 6 | 5 | 6 | 6 |
| 3 | 3 | 1 | 1 | 3 | 4 | 5 | 5 | 6 |
| 3 | 2 | 2 | 3 | 3 | 10 | 7 | 8 | 10 |
| 2 | 2 | 1 | 1 | 5 | 2 | 15 | 14 | 15 |
| 3 | 2 | 1 | 2 | 5 | 4 | 6 | 10 | 10 |
| 4 | 2 | 1 | 2 | 5 | 10 | 15 | 15 | 15 |
| Variable Name | Type | Description & Sample Space |
|---|---|---|
| school | categorical | Students’ school: [GP, MS] |
| sex | categorical | Students’ sex: [F, M] |
| age | quantitative | Students’ age: [15-22] |
| address | categorical | Students’ home address type: [U, R] |
| famsize | categorical | Students’ family size: [LE3, GT3] |
| Pstatus | categorical | Parents’ cohabitation status: [T, A] |
| Medu | categorical | Mother’s education: [0:4] |
| Fedu | categorical | Father’s education: [0:4] |
| Mjob | categorical | Mother’s job |
| Fjob | categorical | Father’s job |
| reason | categorical | Why students chose the school |
| guardian | categorical | Students’ guardian |
| traveltime | categorical | Travel time from home to school: [1:4] |
| studytime | categorical | Weekly study time: [1:4] |
| failures | categorical | Number of past class failures: [0:4] |
| schoolsup | categorical | Extra educational support: [yes, no] |
| famsup | categorical | Family educational support: [yes, no] |
| paid | categorical | Extra paid classes: [yes, no] |
| activities | categorical | Extracurricular activities: [yes, no] |
| nursery | categorical | Attended nursery school: [yes, no] |
| higher | categorical | Interested in higher education: [yes, no] |
| internet | categorical | Availability of internet at home: [yes, no] |
| romantic | categorical | In a romantic relationship: [yes, no] |
| famrel | categorical | Family relationship quality: [1:5] |
| freetime | categorical | Free time after school: [1:5] |
| goout | categorical | Goes out with friends: [1:5] |
| Dalc | categorical | Alcohol consumption on workdays: [1:5] |
| Walc | categorical | Alcohol consumption on weekends: [1:5] |
| health | categorical | Student’s health: [1:5] |
| absences | quantitative | Number of school absences: 0~93 |
| G1 | quantitative | First period grade: 0~20 |
| G2 | quantitative | Second period grade: 0~20 |
| G3 | quantitative | Final grade: 0~20 |
There are 30 variables and 3 targets (G1, G2, G3). According to the author of the data:
From the histogram, other than the outliers where the grades are 0s, the grades are normally distributed.
Here are some other statistics on G3.
| Values | |
|---|---|
| Variable Name | G3 |
| Mean | 10.4151898734177 |
| Minimum Value | 0 |
| Q1 | 8 |
| Median Value | 11 |
| Q3 | 14 |
| Maximum Value | 20 |
“Table 4. Summary of Data Set”
| Name | maths_study |
| Number of rows | 395 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 17 |
| numeric | 16 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| school | 0 | 1 | 2 | 2 | 0 | 2 | 0 |
| sex | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| address | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| famsize | 0 | 1 | 3 | 3 | 0 | 2 | 0 |
| Pstatus | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| Mjob | 0 | 1 | 5 | 8 | 0 | 5 | 0 |
| Fjob | 0 | 1 | 5 | 8 | 0 | 5 | 0 |
| reason | 0 | 1 | 4 | 10 | 0 | 4 | 0 |
| guardian | 0 | 1 | 5 | 6 | 0 | 3 | 0 |
| schoolsup | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| famsup | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| paid | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| activities | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| nursery | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| higher | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| internet | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| romantic | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 16.70 | 1.28 | 15 | 16 | 17 | 18 | 22 | ▇▅▅▁▁ |
| Medu | 0 | 1 | 2.75 | 1.09 | 0 | 2 | 3 | 4 | 4 | ▁▃▆▆▇ |
| Fedu | 0 | 1 | 2.52 | 1.09 | 0 | 2 | 2 | 3 | 4 | ▁▆▇▇▇ |
| traveltime | 0 | 1 | 1.45 | 0.70 | 1 | 1 | 1 | 2 | 4 | ▇▃▁▁▁ |
| studytime | 0 | 1 | 2.04 | 0.84 | 1 | 1 | 2 | 2 | 4 | ▅▇▁▂▁ |
| failures | 0 | 1 | 0.33 | 0.74 | 0 | 0 | 0 | 0 | 3 | ▇▁▁▁▁ |
| famrel | 0 | 1 | 3.94 | 0.90 | 1 | 4 | 4 | 5 | 5 | ▁▁▃▇▅ |
| freetime | 0 | 1 | 3.24 | 1.00 | 1 | 3 | 3 | 4 | 5 | ▁▃▇▆▂ |
| goout | 0 | 1 | 3.11 | 1.11 | 1 | 2 | 3 | 4 | 5 | ▂▆▇▅▃ |
| Dalc | 0 | 1 | 1.48 | 0.89 | 1 | 1 | 1 | 2 | 5 | ▇▂▁▁▁ |
| Walc | 0 | 1 | 2.29 | 1.29 | 1 | 1 | 2 | 3 | 5 | ▇▅▅▃▂ |
| health | 0 | 1 | 3.55 | 1.39 | 1 | 3 | 4 | 5 | 5 | ▂▂▅▃▇ |
| absences | 0 | 1 | 5.71 | 8.00 | 0 | 0 | 4 | 8 | 75 | ▇▁▁▁▁ |
| G1 | 0 | 1 | 10.91 | 3.32 | 3 | 8 | 11 | 13 | 19 | ▂▇▇▆▂ |
| G2 | 0 | 1 | 10.71 | 3.76 | 0 | 9 | 11 | 13 | 19 | ▁▂▇▆▂ |
| G3 | 0 | 1 | 10.42 | 4.58 | 0 | 8 | 11 | 14 | 20 | ▂▃▇▅▁ |
The data set has no missing data. No outlier exists since all data is categorical. Carnalities seem even, no irregular carnality noticed.
Therefore, analysis was made with the data with no pre process of data.
Question we are trying to answer: which variable corresponds with the least and the greatest impact on a students’ academic performance?
In this graph, correlations between grades are high as the author of the data set stated. Therefore, we will not be using all three target variables (G1, G2, and G3), we will be only using G3 which is the overall grade for analytic purposes.
In order to look at all variables, attributes are separated into 4 domains:
This shows correlations of G3 with family related variables.
This show correlations of G3 and entertainment related variables.
This show correlations of G3 and Academic related variables such as higher/failures.
This shows correlations of G3 and other attributes students such as age.
| school | sex | age | address | famsize | Pstatus | Medu | Fedu |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 18 | 2 | 1 | 1 | 4 | 4 |
| 1 | 1 | 17 | 2 | 1 | 2 | 1 | 1 |
| 1 | 1 | 15 | 2 | 2 | 2 | 1 | 1 |
| 1 | 1 | 15 | 2 | 1 | 2 | 4 | 2 |
| 1 | 1 | 16 | 2 | 1 | 2 | 3 | 3 |
| 1 | 2 | 16 | 2 | 2 | 2 | 4 | 3 |
| Mjob | Fjob | reason | guardian | traveltime | studytime | failures | schoolsup |
|---|---|---|---|---|---|---|---|
| 1 | 5 | 1 | 2 | 2 | 2 | 0 | 2 |
| 1 | 3 | 1 | 1 | 1 | 2 | 0 | 1 |
| 1 | 3 | 3 | 2 | 1 | 2 | 3 | 2 |
| 2 | 4 | 2 | 2 | 1 | 3 | 0 | 1 |
| 3 | 3 | 2 | 1 | 1 | 2 | 0 | 1 |
| 4 | 3 | 4 | 2 | 1 | 2 | 0 | 1 |
| famsup | paid | activities | nursery | higher | internet | romantic | famrel |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 2 | 2 | 1 | 1 | 4 |
| 2 | 1 | 1 | 1 | 2 | 2 | 1 | 5 |
| 1 | 2 | 1 | 2 | 2 | 2 | 1 | 4 |
| 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 |
| 2 | 2 | 1 | 2 | 2 | 1 | 1 | 4 |
| 2 | 2 | 2 | 2 | 2 | 2 | 1 | 5 |
| freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 |
|---|---|---|---|---|---|---|---|---|
| 3 | 4 | 1 | 1 | 3 | 6 | 5 | 6 | 6 |
| 3 | 3 | 1 | 1 | 3 | 4 | 5 | 5 | 6 |
| 3 | 2 | 2 | 3 | 3 | 10 | 7 | 8 | 10 |
| 2 | 2 | 1 | 1 | 5 | 2 | 15 | 14 | 15 |
| 3 | 2 | 1 | 2 | 5 | 4 | 6 | 10 | 10 |
| 4 | 2 | 1 | 2 | 5 | 10 | 15 | 15 | 15 |
This shows correlations of G3 with family related variables such as mother/father education etc. G3 has the highest correlation with the level of mothers’ education
This show correlations of G3 and entertainment related variables such as goout/Travel time etc. G3 has the highest correlation with the amount of time going out.
This show correlations of G3 and Academic related variables such as higher/failures. G3 has the highest correlation with the failures variable.
This shows correlations of G3 and other attributes students such as age. Here G3 has the highest correlation with a student’s age variable.
The data set contains lots of variables; hence, we have analyzed them with PCA.
| PC1 | PC2 | PC3 | PC4 | PC5 | |
|---|---|---|---|---|---|
| school | 0.0034393 | 0.0035950 | -0.0146910 | 0.0533958 | -0.0050823 |
| sex | 0.0042130 | -0.0088727 | -0.1283385 | -0.0076321 | -0.0239502 |
| age | -0.0289796 | 0.0314500 | -0.0756241 | 0.3983161 | -0.0921604 |
| address | 0.0015917 | -0.0094165 | 0.0026950 | -0.0474724 | 0.0193215 |
| famsize | -0.0019793 | -0.0073368 | -0.0286624 | 0.0230301 | 0.0129049 |
| Pstatus | 0.0051238 | 0.0026267 | 0.0008888 | 0.0174341 | -0.0066342 |
| Medu | -0.0133557 | -0.0561666 | -0.1191241 | -0.5107038 | 0.0193417 |
| Fedu | -0.0029806 | -0.0460025 | -0.1174808 | -0.4423881 | -0.0448584 |
| Mjob | -0.0076591 | -0.0284290 | -0.1975051 | -0.4825205 | -0.0734298 |
| Fjob | -0.0008364 | -0.0126785 | -0.0868130 | -0.1217384 | 0.0135957 |
| reason | -0.0176413 | -0.0329332 | 0.1507109 | -0.0928095 | 0.2591892 |
| guardian | -0.0111539 | 0.0073716 | 0.0151965 | 0.0519465 | -0.0199573 |
| traveltime | 0.0008689 | 0.0204531 | -0.0344538 | 0.0996479 | -0.0221689 |
| studytime | 0.0069854 | -0.0273182 | 0.1639636 | -0.0271974 | -0.0092362 |
| failures | -0.0067216 | 0.0584286 | -0.0538223 | 0.1201256 | -0.0032652 |
| schoolsup | -0.0010061 | 0.0109059 | 0.0260322 | -0.0335882 | 0.0355734 |
| famsup | -0.0015418 | 0.0062674 | 0.0193152 | -0.0823382 | 0.0104148 |
| paid | -0.0004285 | -0.0089173 | -0.0070881 | -0.0455181 | 0.0635420 |
| activities | 0.0009319 | -0.0062248 | -0.0054827 | -0.0509431 | -0.0060290 |
| nursery | -0.0008824 | -0.0066744 | 0.0123192 | -0.0509694 | -0.0014797 |
| higher | 0.0016581 | -0.0086730 | 0.0089801 | -0.0291958 | 0.0003309 |
| internet | -0.0046603 | -0.0088330 | -0.0166627 | -0.0530546 | 0.0287127 |
| romantic | -0.0091878 | 0.0073550 | 0.0098075 | 0.0151318 | -0.0379199 |
| famrel | 0.0050636 | 0.0005706 | 0.0003452 | -0.0122973 | -0.1384288 |
| freetime | 0.0071714 | 0.0024534 | -0.2527927 | 0.0136479 | -0.0566476 |
| goout | -0.0070293 | 0.0394399 | -0.3925895 | 0.0615628 | 0.1144059 |
| Dalc | -0.0130810 | 0.0159337 | -0.3753694 | 0.0817900 | 0.1035887 |
| Walc | -0.0230123 | 0.0309674 | -0.6358956 | 0.1913719 | 0.2163594 |
| health | 0.0050971 | 0.0290182 | -0.2784197 | -0.0698200 | -0.6409292 |
| absences | -0.9981818 | -0.0321442 | 0.0214435 | -0.0021858 | -0.0133280 |
| G1 | 0.0204953 | -0.6473123 | 0.0128437 | 0.1528869 | -0.4902631 |
| G2 | 0.0238063 | -0.7500806 | -0.0716886 | 0.0005639 | 0.4026845 |
And one interesting information we found was the number of absences had a significant negative impact on G3 in PC1.
From the analysis, because attributes absences had small correlation in ggpair, but big impact in PCA, it can be stated that absences alone does not tell us a lot about the student and their grades; however, when there are more and more other attributes available, absences variable can tell us a lot more about the student and their grades.
However, note that when the data set was being converted into numeric values, some variables do not have any relation to the numbers they are assigned to. For instance, attribute Mjob has value “at_home” and it is assigned to value 1 and other values to other numbers. There is no specific order to these; hence, tools like PCA may work poorly or possibly inaccurate at all.
Second question we are trying to answer:
First, variables impacting on the support from family availability is looked at.
Only the variables related to family attributes are selected.
First 6 rows of the selected table:
| famsize | Pstatus | Medu | Fedu | Mjob | Fjob | famsup | famrel | guardian | age | G3 |
|---|---|---|---|---|---|---|---|---|---|---|
| GT3 | A | 4 | 4 | at_home | teacher | no | 4 | mother | 18 | 6 |
| GT3 | T | 1 | 1 | at_home | other | yes | 5 | father | 17 | 6 |
| LE3 | T | 1 | 1 | at_home | other | no | 4 | mother | 15 | 10 |
| GT3 | T | 4 | 2 | health | services | yes | 3 | mother | 15 | 15 |
| GT3 | T | 3 | 3 | other | other | yes | 4 | father | 16 | 10 |
| LE3 | T | 4 | 3 | services | other | yes | 5 | mother | 16 | 15 |
Scatter plots of some of variables are generated; however, all variables but age and G3 are categorical, scatter plots are meaningless.
So, as you can study from the data, almost all variables in the data set we have is categorical with 5 or less different values; hence, it is almost meaningless to make scatter plots.
Now, because of the reason stated above, other methods will be required.
First approach was to just observe the variables and see how they relate to each other.
| famsup | Pstatus | Mother/Father | Occupation |
|---|---|---|---|
| no | A | Mjob | at_home |
| no | A | Fjob | teacher |
| yes | T | Mjob | at_home |
| yes | T | Fjob | other |
| no | T | Mjob | at_home |
| no | T | Fjob | other |
From the visualization, occupation other dominates. Generally, no matter the occupation of the parents, there are more supportive parents than those who are not supportive.
Hence, now PStatus variable is taken into account which tells us whether the parents live with the student together or apart.
But because there are a lot more supportive parents, it is hard to observe the ratio of not supportive parents, so they are now separated.
Second approach was PCA as there are so many variables.
Because PCA can take numeric values only, all categorized or character values must be switched to numeric values.
|
From the table, PC1 has huge impacts by Medu, Fedu, and Mjob.
This is interesting because the the plots are telling us that the occupation of mothers matter a lot more than the occupation of fathers. If we refer back the figure x, there are many mothers staying home support children fully. We expect this is one of the reasons.
Now, linear models will be used to see the trends of G3 over age.
From this visualization, notice how the mean grade of students is extremely high. This is because there are only 3 students those are 20 years old. For ages 21 and 22, there are only one student each. Hence, we will ignore the 5 instances by filter only the instances with age less than 20.
Both mean and median grades of students show some decrease over age.
Therefore, a general simple linear fit also shows a decrease in trend.
25%, 50%, and 75% percent quantiles all show decrease in trend.
Figures x and y show that grades of students with no family support
available more steep decrease in grades over grade. Also figure y tells
us that students with family support around the lower 25% tend to do
better over age; however, those without the support keep decreasing in
grades.
Earlier, we noticed that the occupation of mothers matter significantly more than the occupation of fathers. Hence, color is set to mothers’ occupation to see why that is the case.
From both figures x and y, we can observe that students with mothers staying home are getting better grades over time. This is expected as mothers staying home will be able to spend more time on their children and care more.
Last question we are trying to answer: What is the correlation between alcohol consumption and variables that impact academic performance such as hours studying, number of absences, number of class failures?
Here are the significant data visualizations generated in order to answer question 2.
Here are the significant data visualizations generated in order to answer question 3.
Student absences have the greatest negative impact on their grades and there is no single variable relating to overtly positive effects on grades. Note that absences tell us more about the students as more attributes are available.
The family life of a student does have an impact on their grades. For instance, support availability from parents had impact on the grades. Hence, we can conclude that family related attributes effect students’ academic performances.
Based on the visualizations, alcohol consumption does have an effect on study time but it does not have an effect on absences, and study time has impact on academic performances: increased alcohol consumption is associated with studying less. Therefore, we can conclude that alcohol consumption has impact on the grades.
In conclusion, there are many factors when it comes to students’ academic performances. They can very based on many different attributes!